The attention mechanism is designed to focus on different parts of the input data, depending on the context. In the context of the Transformer model, the attention mechanism allows the model to focus on different words in the input sequence when producing an output sequence. The strength of the attention is determined by a score, which is calculated using a query, key, and value.
Given a Query (Q), a Key (K), and a Value (V), the attention mechanism computes a weighted sum of the values, where the weight assigned to each value is determined by the query and the corresponding key.
The attention score for a query Q and a key K is calculated as:
score(Q,K)=Q⋅KT
This score is then passed through a softmax function to get the attention weights:
softmax(dkscore(Q,K))
Where dk is the dimension of the key vectors (this scaling factor helps in stabilizing the gradients).
Finally, the output is calculated as a weighted sum of the values:
In practice, we don’t calculate attention for a single word, but rather for a set of words (i.e., a sequence). To do this efficiently, we use matrix operations.
In multi-head attention, the idea is to have multiple sets of Query, Key, Value weight matrices. Each of these sets will generate different attention scores and outputs. By doing this, the model can focus on different subspaces of the data.
Let's denote the number of heads as h.
For each head i, we have its own weight matrices: WQi, WKi, and WVi.
For each head i, compute the Query, Key, and Value matrices just like in single-head attention calculate for each head from i to h.
Qi=X×WQi
Ki=X×WKi
Vi=X×WVi
Using the Qi, Ki, and Vi matrices, we calculate the output for each head:
Scorei=Qi×KiT
Scaled Scorei=dkScorei
Attention Weightsi=softmax(Scaled Scorei)
Outputi=Attention Weightsi×Vi
Now, after obtaining the output for each head, we need to combine these outputs to get a single unified output.
Concatenation & Linear Transformation: The outputs from all heads are concatenated and then linearly transformed to produce the final output:
This multi-head mechanism allows the Transformer to focus on different positions with different subspace representations, making it more expressive and capable of capturing various types of relationships in the data.
Get the latest updates, exclusive content and special offers delivered directly to your mailbox. Subscribe now!
ClassFlame – Where Learning Meets Conversation! offers conversational-style books in Computer Science, Mathematics, AI, and ML, making complex subjects accessible and engaging through interactive learning and expertly curated content.